Data Description¶
Data Source: Determinants of Airbnb prices in European cities: A spatial econometrics approach (Supplementary Material)
The dataset includes different determinants of Airbnb prices across different European cities (London, Rome and Budapest) during the weekdays and weekends. For each city, the dataset includes 19 variables that were determined via spatial econometric analysis methods.
| Variable Name | Data Type | Description |
|---|---|---|
| realSum | numerical | price in Euros for a 2 night stay for 2 people |
| room_type | categorical | type of room (ie. shared, private) |
| room_shared | boolean | is the room shared? |
| room_private | boolean | is the room private? |
| person_capacity | numerical | maximum guest capacity in the room |
| host_is_superhost | boolean | does the host have superhost status? |
| multi | binary | does property have multiple listings? |
| biz | binary | is property hosted for business purposes? |
| cleanliness_rating | numerical | cleanliness rating of the listing |
| guest_satisfaction_overall | numerical | overall guest satisfaction rating |
| bedrooms | numerical | number of bedrooms |
| dist | numerical | distance from the city centre in km |
| metro_dist | numerical | distance from the nearest metro station in km |
| attr_index | numerical | attraction index of airbnb location |
| attr_index_norm | numerical | normalised attraction index of airbnb location (0-100) |
| rest_index | numerical | restaurant index of airbnb location |
| rest_index_norm | numerical | normalised restaurant index of airbnb location (0-100) |
| lng | numerical | longitude of airbnb location |
| lat | numerical | latitude of airbnb location |
The number of observations for Airbnbs in...
- London are 4613 (weekdays) and 5378 (weekends)
- Rome are 4491 (weekdays) and 4534 (weekends)
- Budapest are 2073 (weekdays) and 1947 (weekends)
Question of Interest¶
We want to examine the association between the price of Airbnb listings (response) and predictors related to the location of the Airbnb (e.g., distance from the city centre, distance from the nearest metro station, attraction index), which European city (London, Budapest, or Rome) and when during the week (e.g., weekday or weekend).
Response Variable: realSum
Exploratory Variables (not limited to): dist, metro_dist, attr_index, attr_index_norm, rest_index, rest_index_norm
Primary Focus: Prediction – we want to be able to predict how much an Airbnb listing should be priced to help Airbnb hosts price their listings competitively. Although our main focus will be prediction, inference will help us develop a better understanding of how price is affected by the different variables and could tell us about what Airbnb customers may look for in an accomodation.
# load libraries
library(tidyverse)
library(dplyr)
library(ggplot2)
library(httr)
library(readr)
library(car)
library(glmnet)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ── ✔ dplyr 1.1.4 ✔ readr 2.1.5 ✔ forcats 1.0.0 ✔ stringr 1.5.1 ✔ ggplot2 3.5.1 ✔ tibble 3.2.1 ✔ lubridate 1.9.3 ✔ tidyr 1.3.1 ✔ purrr 1.0.2 ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag() ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors Loading required package: carData Attaching package: ‘car’ The following object is masked from ‘package:dplyr’: recode The following object is masked from ‘package:purrr’: some Loading required package: Matrix Attaching package: ‘Matrix’ The following objects are masked from ‘package:tidyr’: expand, pack, unpack Loaded glmnet 4.1-8
# read datasets using URL
url <- "https://www.kaggle.com/api/v1/datasets/download/thedevastator/airbnb-prices-in-european-cities"
file <- "airbnb_data.zip"
GET(url, write_disk(file, overwrite = TRUE))
unzip(file, exdir = "airbnb_data")
# read data for individual cities
london_weekday <- read_csv("airbnb_data/london_weekdays.csv")
london_weekend <- read_csv("airbnb_data/london_weekends.csv")
rome_weekday <- read_csv("airbnb_data/rome_weekdays.csv")
rome_weekend <- read_csv("airbnb_data/rome_weekends.csv")
budapest_weekday <- read_csv("airbnb_data/budapest_weekdays.csv")
budapest_weekend <- read_csv("airbnb_data/budapest_weekends.csv")
Response [https://storage.googleapis.com:443/kaggle-data-sets/2919695/7809961/bundle/archive.zip?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20250328%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20250328T012043Z&X-Goog-Expires=259200&X-Goog-SignedHeaders=host&X-Goog-Signature=b3c4212db609209300b638da94daf575c9dde2d2117aacfd03b2a91e0312bd652658e7508ad53db7b08fab074f987935992509e1663b6d074e4cf5cc53ab9af161b984ab7df51e2cfcacbd09c8f5904901a4bb12267f43493e2ad6d29709df7ef6558a3e9404f5fd46fd5c30c285e09e9a300d5c22bee3dd316871dfe36cabe905776fd7f66ce78d4b1eea1add38bbf07afa01c974b8b75d7d56ab8fb73fe7419c27ea3d1b1833303df81017003f8c0003c64f836dcc68a4076cb65e0b752b2fa374d60a069fc4efb370dfe2ab05f93fba750a2d29d48faed7cad75e645854458dd1562ae89f9a61823b144881727d514408523d7af2b1089214014e8d9ffc6f] Date: 2025-03-28 01:20 Status: 200 Content-Type: application/zip Size: 4.1 MB <ON DISK> /home/jovyan/work/stat-301/project/airbnb_data.zipNULL
New names: • `` -> `...1` Rows: 4614 Columns: 20 ── Column specification ──────────────────────────────────────────────────────── Delimiter: "," chr (1): room_type dbl (16): ...1, realSum, person_capacity, multi, biz, cleanliness_rating, gu... lgl (3): room_shared, room_private, host_is_superhost ℹ Use `spec()` to retrieve the full column specification for this data. ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. New names: • `` -> `...1` Rows: 5379 Columns: 20 ── Column specification ──────────────────────────────────────────────────────── Delimiter: "," chr (1): room_type dbl (16): ...1, realSum, person_capacity, multi, biz, cleanliness_rating, gu... lgl (3): room_shared, room_private, host_is_superhost ℹ Use `spec()` to retrieve the full column specification for this data. ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. New names: • `` -> `...1` Rows: 4492 Columns: 20 ── Column specification ──────────────────────────────────────────────────────── Delimiter: "," chr (1): room_type dbl (16): ...1, realSum, person_capacity, multi, biz, cleanliness_rating, gu... lgl (3): room_shared, room_private, host_is_superhost ℹ Use `spec()` to retrieve the full column specification for this data. ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. New names: • `` -> `...1` Rows: 4535 Columns: 20 ── Column specification ──────────────────────────────────────────────────────── Delimiter: "," chr (1): room_type dbl (16): ...1, realSum, person_capacity, multi, biz, cleanliness_rating, gu... lgl (3): room_shared, room_private, host_is_superhost ℹ Use `spec()` to retrieve the full column specification for this data. ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. New names: • `` -> `...1` Rows: 2074 Columns: 20 ── Column specification ──────────────────────────────────────────────────────── Delimiter: "," chr (1): room_type dbl (16): ...1, realSum, person_capacity, multi, biz, cleanliness_rating, gu... lgl (3): room_shared, room_private, host_is_superhost ℹ Use `spec()` to retrieve the full column specification for this data. ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. New names: • `` -> `...1` Rows: 1948 Columns: 20 ── Column specification ──────────────────────────────────────────────────────── Delimiter: "," chr (1): room_type dbl (16): ...1, realSum, person_capacity, multi, biz, cleanliness_rating, gu... lgl (3): room_shared, room_private, host_is_superhost ℹ Use `spec()` to retrieve the full column specification for this data. ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# add city variable and weekday/weekend variable
# London Airbnbs
london_weekday['city'] <- 'London'
london_weekday['weekend_weekday'] <- 'Weekday'
london_weekend['city'] <- 'London'
london_weekend['weekend_weekday'] <- 'Weekend'
# Rome Airbnbs
rome_weekday['city'] <- 'Rome'
rome_weekday['weekend_weekday'] <- 'Weekday'
rome_weekend['city'] <- 'Rome'
rome_weekend['weekend_weekday'] <- 'Weekend'
# Budapest Airbnbs
budapest_weekday['city'] <- 'Budapest'
budapest_weekday['weekend_weekday'] <- 'Weekday'
budapest_weekend['city'] <- 'Budapest'
budapest_weekend['weekend_weekday'] <- 'Weekend'
# combine into 1 dataframe
london_df <- merge(london_weekday, london_weekend, all.x = TRUE, all.y = TRUE)
rome_df <- merge(rome_weekday, rome_weekend, all.x = TRUE, all.y = TRUE)
budapest_df <- merge(budapest_weekday, budapest_weekend, all.x = TRUE, all.y = TRUE)
airbnb_df <- merge(london_df, rome_df, all.x = TRUE, all.y = TRUE) %>% merge(budapest_df, all.x = TRUE, all.y = TRUE)
head(airbnb_df)
| ...1 | realSum | room_type | room_shared | room_private | person_capacity | host_is_superhost | multi | biz | cleanliness_rating | ⋯ | dist | metro_dist | attr_index | attr_index_norm | rest_index | rest_index_norm | lng | lat | city | weekend_weekday | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <dbl> | <chr> | <lgl> | <lgl> | <dbl> | <lgl> | <dbl> | <dbl> | <dbl> | ⋯ | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <chr> | <chr> | |
| 1 | 0 | 121.1223 | Private room | FALSE | TRUE | 2 | FALSE | 0 | 0 | 6 | ⋯ | 5.7341167 | 0.4370940 | 222.8822 | 15.493414 | 470.0885 | 8.413765 | -0.04975 | 51.52570 | London | Weekend |
| 2 | 0 | 156.8747 | Private room | FALSE | TRUE | 2 | TRUE | 1 | 0 | 10 | ⋯ | 2.9784678 | 1.5957331 | 281.1639 | 6.230648 | 697.7272 | 15.191486 | 12.48654 | 41.92498 | Rome | Weekday |
| 3 | 0 | 172.7725 | Private room | FALSE | TRUE | 2 | FALSE | 0 | 0 | 10 | ⋯ | 1.2225824 | 0.3977605 | 550.0784 | 12.187232 | 1075.4121 | 23.430621 | 12.50181 | 41.88987 | Rome | Weekend |
| 4 | 0 | 238.9905 | Entire home/apt | FALSE | FALSE | 6 | TRUE | 0 | 1 | 10 | ⋯ | 0.3593550 | 0.3526430 | 404.4047 | 24.116552 | 893.4773 | 67.656853 | 19.05074 | 47.50076 | Budapest | Weekday |
| 5 | 0 | 332.0487 | Entire home/apt | FALSE | FALSE | 6 | TRUE | 0 | 1 | 10 | ⋯ | 0.3593723 | 0.3526618 | 404.3985 | 24.136091 | 893.4182 | 78.100790 | 19.05074 | 47.50076 | Budapest | Weekend |
| 6 | 0 | 570.0981 | Entire home/apt | FALSE | FALSE | 2 | FALSE | 0 | 0 | 10 | ⋯ | 5.3010178 | 1.5889904 | 209.6326 | 14.571793 | 467.5975 | 8.372724 | -0.16032 | 51.46531 | London | Weekday |
# check for missing values
missing_val_count <- sum(is.na(airbnb_df))
missing_val_count
Pre-selection of variables¶
I noticed that there is a column titled ...1 that acts as an identity/unique identifier column. So, I am going to remove it because the values have become irrelevant now that the different datasets been merged into one.
There are also multiple columns in the dataset that measure the same variable: room_type, room_shared, room_private. For this reason I choose to remove the room_shared and room_private columns to simplify and remove repeated data the same information is conveyed in the room_type variable.
# drop 'room_shared' and 'room_private' columns
airbnb_df <- select(airbnb_df, -(c('...1', room_shared, room_private)))
head(airbnb_df)
| realSum | room_type | person_capacity | host_is_superhost | multi | biz | cleanliness_rating | guest_satisfaction_overall | bedrooms | dist | metro_dist | attr_index | attr_index_norm | rest_index | rest_index_norm | lng | lat | city | weekend_weekday | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <chr> | <dbl> | <lgl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <chr> | <chr> | |
| 1 | 121.1223 | Private room | 2 | FALSE | 0 | 0 | 6 | 69 | 1 | 5.7341167 | 0.4370940 | 222.8822 | 15.493414 | 470.0885 | 8.413765 | -0.04975 | 51.52570 | London | Weekend |
| 2 | 156.8747 | Private room | 2 | TRUE | 1 | 0 | 10 | 95 | 1 | 2.9784678 | 1.5957331 | 281.1639 | 6.230648 | 697.7272 | 15.191486 | 12.48654 | 41.92498 | Rome | Weekday |
| 3 | 172.7725 | Private room | 2 | FALSE | 0 | 0 | 10 | 93 | 1 | 1.2225824 | 0.3977605 | 550.0784 | 12.187232 | 1075.4121 | 23.430621 | 12.50181 | 41.88987 | Rome | Weekend |
| 4 | 238.9905 | Entire home/apt | 6 | TRUE | 0 | 1 | 10 | 99 | 1 | 0.3593550 | 0.3526430 | 404.4047 | 24.116552 | 893.4773 | 67.656853 | 19.05074 | 47.50076 | Budapest | Weekday |
| 5 | 332.0487 | Entire home/apt | 6 | TRUE | 0 | 1 | 10 | 99 | 1 | 0.3593723 | 0.3526618 | 404.3985 | 24.136091 | 893.4182 | 78.100790 | 19.05074 | 47.50076 | Budapest | Weekend |
| 6 | 570.0981 | Entire home/apt | 2 | FALSE | 0 | 0 | 10 | 98 | 1 | 5.3010178 | 1.5889904 | 209.6326 | 14.571793 | 467.5975 | 8.372724 | -0.16032 | 51.46531 | London | Weekday |
# check observations in categorical levels
# roomtype per city
rm_type_per_city_count <- airbnb_df %>%
group_by(city) %>%
count(room_type)
rm_type_per_city_count
| city | room_type | n |
|---|---|---|
| <chr> | <chr> | <int> |
| Budapest | Entire home/apt | 3589 |
| Budapest | Private room | 419 |
| Budapest | Shared room | 14 |
| London | Entire home/apt | 4384 |
| London | Private room | 5559 |
| London | Shared room | 50 |
| Rome | Entire home/apt | 5561 |
| Rome | Private room | 3454 |
| Rome | Shared room | 12 |
From the table above, the level of room_type = "Shared room" is low incomparison to the other values. If we kept this data in the dataset, there is a high chance that we would encounter problems when splitting the data into training and testing set. There is a chance that all the "Shared room" data points might fall into the testing set. Our model wouldn't recognize these points because it wouldn't have been trained on data points where room_type = "Shared room". Thus, I will drop this categorical level and remove the data points accordingly.
# remove observations that include room_type = "Shared room"
clean_airbnb_df <- airbnb_df %>% filter(room_type != "Shared room")
rm_type_per_city_count_2 <- clean_airbnb_df %>%
group_by(city) %>%
count(room_type)
rm_type_per_city_count_2
| city | room_type | n |
|---|---|---|
| <chr> | <chr> | <int> |
| Budapest | Entire home/apt | 3589 |
| Budapest | Private room | 419 |
| London | Entire home/apt | 4384 |
| London | Private room | 5559 |
| Rome | Entire home/apt | 5561 |
| Rome | Private room | 3454 |
Because our question of interest uses realSum as the response variable, let's look at the range of the values of this variable.
# check min and max prices of airbnbs
max_price <- max(clean_airbnb_df$realSum)
min_price <- min(clean_airbnb_df$realSum)
cat("highest price: ", max_price, "lowest price: ", min_price)
highest price: 15499.89 lowest price: 34.77934
#check average prices of airbnbs per city
avg_prices <- clean_airbnb_df %>%
group_by(city, weekend_weekday) %>%
summarise("avg_price" = mean(realSum))
avg_prices
`summarise()` has grouped output by 'city'. You can override using the `.groups` argument.
| city | weekend_weekday | avg_price |
|---|---|---|
| <chr> | <chr> | <dbl> |
| Budapest | Weekday | 168.5781 |
| Budapest | Weekend | 185.3225 |
| London | Weekday | 361.1261 |
| London | Weekend | 365.3278 |
| Rome | Weekday | 201.7924 |
| Rome | Weekend | 209.2436 |
head(clean_airbnb_df)
| realSum | room_type | person_capacity | host_is_superhost | multi | biz | cleanliness_rating | guest_satisfaction_overall | bedrooms | dist | metro_dist | attr_index | attr_index_norm | rest_index | rest_index_norm | lng | lat | city | weekend_weekday | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <chr> | <dbl> | <lgl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <chr> | <chr> | |
| 1 | 121.1223 | Private room | 2 | FALSE | 0 | 0 | 6 | 69 | 1 | 5.7341167 | 0.4370940 | 222.8822 | 15.493414 | 470.0885 | 8.413765 | -0.04975 | 51.52570 | London | Weekend |
| 2 | 156.8747 | Private room | 2 | TRUE | 1 | 0 | 10 | 95 | 1 | 2.9784678 | 1.5957331 | 281.1639 | 6.230648 | 697.7272 | 15.191486 | 12.48654 | 41.92498 | Rome | Weekday |
| 3 | 172.7725 | Private room | 2 | FALSE | 0 | 0 | 10 | 93 | 1 | 1.2225824 | 0.3977605 | 550.0784 | 12.187232 | 1075.4121 | 23.430621 | 12.50181 | 41.88987 | Rome | Weekend |
| 4 | 238.9905 | Entire home/apt | 6 | TRUE | 0 | 1 | 10 | 99 | 1 | 0.3593550 | 0.3526430 | 404.4047 | 24.116552 | 893.4773 | 67.656853 | 19.05074 | 47.50076 | Budapest | Weekday |
| 5 | 332.0487 | Entire home/apt | 6 | TRUE | 0 | 1 | 10 | 99 | 1 | 0.3593723 | 0.3526618 | 404.3985 | 24.136091 | 893.4182 | 78.100790 | 19.05074 | 47.50076 | Budapest | Weekend |
| 6 | 570.0981 | Entire home/apt | 2 | FALSE | 0 | 0 | 10 | 98 | 1 | 5.3010178 | 1.5889904 | 209.6326 | 14.571793 | 467.5975 | 8.372724 | -0.16032 | 51.46531 | London | Weekday |
Visualization¶
Now that the data has been cleaned and wrangled, let's visualise the datapoints using a scatterplot. Here I am interested in the following variables: realSum, metro_dist, city, weekend_weekday. I want to know how the price is affected by these variables. Using colour and shape, I differentiated the cities and whether it is a weekend or weekday listing.
# plot a scatterplot
options(repr.plot.width = 15, repr.plot.height = 10) #
scatter_price_dist <- clean_airbnb_df %>%
ggplot(aes(x = metro_dist, y = realSum, colour = city, shape = weekend_weekday)) +
geom_point(alpha = 0.3) +
coord_cartesian(xlim = c(0, 9), ylim = c(0, 4000)) +
facet_wrap(~weekend_weekday) +
geom_hline(data = avg_prices, aes(yintercept = avg_price, colour = city)) +
ggtitle("Visualization 1: Price of Airbnbs vs Distance from Nearest Metro Station") +
xlab("Distance from Nearest Metro Station (km)") +
ylab("Price of Airbnb Listing (Euros)")
scatter_price_dist
The graph above shows a concentrated area of datapoints between realSum = 0 and realSum = 4000 with a distance of less than or equal to 9km from the nearest metro station. The horizontal lines indicate the mean prices for each city on weekends and on weekdays.
Analysing the visualization¶
Looking at the graph, we see that the most expensive Airbnbs can be found in London, as shown by the green points in graph scatter_price_dist. Overall, there is a negative correlation between prices of Airbnb listings and the distance from the nearest metro station. So, based on the graphs, we can say that generally, the further away a listing is from a metro station, the lower the price of the listing.
Weekend vs Weekday – Using facet_wrap(~weekend) we can compare the scatterpoints of airbnb listings during the weekend and during weekdays. We see that both plots follow very the same patterns. From the horizontal lines, we can see that the prices for airbnbs during the weekend are slightly higher than on weekdays but the difference is minimal.
Cities – At first glance, the London listings seem to be the most prominent compared to Rome and Budapest. This is expected as the dataset contained more London Airbnbs than Airbnbs in the other 2 cities. The average prices between Rome and Budapest are about the same, with Rome having a slightly greater average.
Main Takeaway from Visualization – This visualization shows that there is a negative relationship between price of Airbnbs and distance to metro stations. Airbnb listings are priced higher when they are at a closer to a metro station in these European cities. Additionally, whether an Airbnb is a weekend or weekday listing seems to have a lesser impact on price as the points seem evenly dispersed throughout the plot. So, in relating to our question of interest, this plot is important because it shows us a possible key determinant of Airbnb prices.
Methods and Plan & Computational Code and Output¶
Method of interest: Multi-covariate Linear Regression with Interaction
I will use multi-covariate linear regression in my model to predict the price of an Airbnb listing in a given city (ie. Budapest, London, or Rome) because the price is affected by many different variables including metrics related to location, capacity and ratings. A linear model works best because the response variable realSum is a continuous numerical variable. Since, I am interested in predicting prices based on the cities, city will be a key predictor in my model. The model will include interaction because variables are associated with one another. For example, the value of person_capacity is related to the number of bedrooms and the guest_satisfaction_rating would have some sort of correlation with the cleanliness_rating.
Assumptions:
- A linear relation between the response variable
realSumand the predictors is present. From Visualization 1 above there seems to be a nonlinear relationship betweenrealSumandmetro_dist. Introducing interaction terms between covariates by implementing an interaction model and log transformations will help to address this. - Errors are independent because we assume that we have a random sample of Airbnb listings.
- There is a normal conditional distribution of error terms, but because we have a fairly large sample size of 22,966, normality of errors may not be necessary to have valid inference results.
- Error terms should have equal variance for the p-value and confidence intervals to hold validity.
Potential limitations and weaknesses:
With multiple linear regression, I need to be aware of multicollinearity which can cause inflation of standard errors and coefficients less interpretable. During the EDA stage, I've removed room_shared and room_private because they measured the same attribute. There may be other variables that will contribute to multicollinearity, so I will use VIF to check for highly correlated predictors.
Additionally, my model assumes linear relationships between predictors and realSum, but exploratory analysis suggests that some variables (e.g., metro_dist) may exhibit nonlinear trends. I will consider transformations or interaction terms to better capture these effects.
Moreover, Airbnb prices are influenced by other factors not included in the original dataset such as available amenities and time of year. Exclusion of these other variables may result in bias, especially if these missing variables are correlated with other predictors present.
Computational Code and Output¶
# splitting data into train and test sets
airbnb_train <- clean_airbnb_df %>%
slice_sample(prop = 0.8)
airbnb_test <- clean_airbnb_df %>%
anti_join(airbnb_train)
Joining with `by = join_by(realSum, room_type, person_capacity, host_is_superhost, multi, biz, cleanliness_rating, guest_satisfaction_overall, bedrooms, dist, metro_dist, attr_index, attr_index_norm, rest_index, rest_index_norm, lng, lat, city, weekend_weekday)`
# check split size
cat("Training Set Size:", nrow(airbnb_train), "\n")
cat("Test Set Size:", nrow(airbnb_test), "\n")
Training Set Size: 18372 Test Set Size: 4594
# fit model with training data
MLR_model <- lm(formula = realSum ~ city * ., data = airbnb_train)
summary(MLR_model)
Call:
lm(formula = realSum ~ city * ., data = airbnb_train)
Residuals:
Min 1Q Median 3Q Max
-1048.9 -57.9 -18.2 28.2 14104.3
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.216e+04 1.761e+04 0.690 0.48998
cityLondon -1.443e+03 1.878e+04 -0.077 0.93874
cityRome -3.016e+04 2.005e+04 -1.504 0.13249
room_typePrivate room -5.019e+01 1.864e+01 -2.693 0.00709
person_capacity 1.206e+01 5.223e+00 2.310 0.02091
host_is_superhostTRUE -1.110e+01 1.136e+01 -0.977 0.32861
multi -2.135e+01 1.313e+01 -1.625 0.10409
biz -3.847e+00 1.297e+01 -0.296 0.76686
cleanliness_rating 8.405e+00 8.543e+00 0.984 0.32525
guest_satisfaction_overall -8.259e-01 1.116e+00 -0.740 0.45930
bedrooms 2.699e+01 9.472e+00 2.849 0.00439
dist 1.306e+00 8.194e+00 0.159 0.87335
metro_dist -5.723e+00 1.286e+01 -0.445 0.65631
attr_index -2.949e+01 1.633e+02 -0.181 0.85665
attr_index_norm 4.964e+02 2.736e+03 0.181 0.85604
rest_index -2.474e-01 4.418e-01 -0.560 0.57559
rest_index_norm 3.000e+00 5.384e+00 0.557 0.57737
lng 3.026e+01 2.949e+02 0.103 0.91826
lat -2.661e+02 3.238e+02 -0.822 0.41131
weekend_weekdayWeekend -1.891e+00 2.380e+01 -0.079 0.93667
cityLondon:room_typePrivate room -1.497e+02 2.040e+01 -7.340 2.22e-13
cityRome:room_typePrivate room 1.147e+01 2.086e+01 0.550 0.58263
cityLondon:person_capacity 1.945e+01 6.540e+00 2.975 0.00293
cityRome:person_capacity 1.260e+00 6.545e+00 0.193 0.84728
cityLondon:host_is_superhostTRUE 3.279e+01 1.477e+01 2.220 0.02641
cityRome:host_is_superhostTRUE 1.165e+01 1.386e+01 0.841 0.40053
cityLondon:multi 9.528e+00 1.568e+01 0.608 0.54351
cityRome:multi 2.279e+01 1.590e+01 1.433 0.15176
cityLondon:biz -8.405e+00 1.559e+01 -0.539 0.58984
cityRome:biz 7.306e+00 1.625e+01 0.450 0.65297
cityLondon:cleanliness_rating -9.302e+00 9.566e+00 -0.972 0.33087
cityRome:cleanliness_rating -8.488e+00 1.051e+01 -0.808 0.41937
cityLondon:guest_satisfaction_overall 2.313e+00 1.201e+00 1.925 0.05424
cityRome:guest_satisfaction_overall 1.526e+00 1.289e+00 1.184 0.23642
cityLondon:bedrooms 1.530e+02 1.187e+01 12.895 < 2e-16
cityRome:bedrooms 1.656e+01 1.234e+01 1.342 0.17952
cityLondon:dist 1.015e+01 8.640e+00 1.174 0.24023
cityRome:dist -1.924e+00 8.782e+00 -0.219 0.82662
cityLondon:metro_dist -1.379e+01 1.356e+01 -1.017 0.30934
cityRome:metro_dist 6.402e+00 1.435e+01 0.446 0.65556
cityLondon:attr_index 2.328e+03 2.206e+03 1.055 0.29133
cityRome:attr_index -6.154e+01 2.439e+02 -0.252 0.80078
cityLondon:attr_index_norm -3.355e+04 3.176e+04 -1.056 0.29091
cityRome:attr_index_norm 3.615e+03 8.622e+03 0.419 0.67499
cityLondon:rest_index 1.399e+01 9.185e+01 0.152 0.87893
cityRome:rest_index -2.320e+01 2.788e+01 -0.832 0.40525
cityLondon:rest_index_norm -7.708e+02 5.131e+03 -0.150 0.88059
cityRome:rest_index_norm 1.075e+03 1.280e+03 0.840 0.40085
cityLondon:lng -2.316e+02 3.019e+02 -0.767 0.44297
cityRome:lng -2.533e+01 3.333e+02 -0.076 0.93941
cityLondon:lat 5.367e+01 3.476e+02 0.154 0.87729
cityRome:lat 6.938e+02 3.865e+02 1.795 0.07264
cityLondon:weekend_weekdayWeekend 2.519e+01 2.722e+01 0.926 0.35464
cityRome:weekend_weekdayWeekend -1.226e-01 2.797e+01 -0.004 0.99650
(Intercept)
cityLondon
cityRome
room_typePrivate room **
person_capacity *
host_is_superhostTRUE
multi
biz
cleanliness_rating
guest_satisfaction_overall
bedrooms **
dist
metro_dist
attr_index
attr_index_norm
rest_index
rest_index_norm
lng
lat
weekend_weekdayWeekend
cityLondon:room_typePrivate room ***
cityRome:room_typePrivate room
cityLondon:person_capacity **
cityRome:person_capacity
cityLondon:host_is_superhostTRUE *
cityRome:host_is_superhostTRUE
cityLondon:multi
cityRome:multi
cityLondon:biz
cityRome:biz
cityLondon:cleanliness_rating
cityRome:cleanliness_rating
cityLondon:guest_satisfaction_overall .
cityRome:guest_satisfaction_overall
cityLondon:bedrooms ***
cityRome:bedrooms
cityLondon:dist
cityRome:dist
cityLondon:metro_dist
cityRome:metro_dist
cityLondon:attr_index
cityRome:attr_index
cityLondon:attr_index_norm
cityRome:attr_index_norm
cityLondon:rest_index
cityRome:rest_index
cityLondon:rest_index_norm
cityRome:rest_index_norm
cityLondon:lng
cityRome:lng
cityLondon:lat
cityRome:lat .
cityLondon:weekend_weekdayWeekend
cityRome:weekend_weekdayWeekend
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 293 on 18318 degrees of freedom
Multiple R-squared: 0.2701, Adjusted R-squared: 0.268
F-statistic: 127.9 on 53 and 18318 DF, p-value: < 2.2e-16
# check for multicolinearity using VIF test
vif(MLR_model) %>% round(4)
there are higher-order terms (interactions) in this model consider setting type = 'predictor'; see ?vif
| GVIF | Df | GVIF^(1/(2*Df)) | |
|---|---|---|---|
| city | 6.168284e+13 | 2 | 2802.4708 |
| room_type | 1.803590e+01 | 1 | 4.2469 |
| person_capacity | 9.950300e+00 | 1 | 3.1544 |
| host_is_superhost | 5.314400e+00 | 1 | 2.3053 |
| multi | 8.117600e+00 | 1 | 2.8491 |
| biz | 8.270600e+00 | 1 | 2.8759 |
| cleanliness_rating | 1.536770e+01 | 1 | 3.9202 |
| guest_satisfaction_overall | 2.410430e+01 | 1 | 4.9096 |
| bedrooms | 6.623100e+00 | 1 | 2.5735 |
| dist | 9.665080e+01 | 1 | 9.8311 |
| metro_dist | 3.528090e+01 | 1 | 5.9398 |
| attr_index | 3.452707e+08 | 1 | 18581.4623 |
| attr_index_norm | 1.744563e+08 | 1 | 13208.1914 |
| rest_index | 1.296269e+04 | 1 | 113.8538 |
| rest_index_norm | 1.456337e+03 | 1 | 38.1620 |
| lng | 1.073485e+06 | 1 | 1036.0914 |
| lat | 4.277705e+05 | 1 | 654.0417 |
| weekend_weekday | 3.027740e+01 | 1 | 5.5025 |
| city:room_type | 6.200650e+01 | 2 | 2.8061 |
| city:person_capacity | 3.129641e+02 | 2 | 4.2060 |
| city:host_is_superhost | 8.036500e+00 | 2 | 1.6837 |
| city:multi | 1.980030e+01 | 2 | 2.1094 |
| city:biz | 2.492990e+01 | 2 | 2.2345 |
| city:cleanliness_rating | 5.334962e+04 | 2 | 15.1979 |
| city:guest_satisfaction_overall | 8.847390e+04 | 2 | 17.2466 |
| city:bedrooms | 1.003659e+02 | 2 | 3.1652 |
| city:dist | 1.447825e+03 | 2 | 6.1685 |
| city:metro_dist | 1.323315e+02 | 2 | 3.3917 |
| city:attr_index | 3.256864e+19 | 2 | 75543.9884 |
| city:attr_index_norm | 2.005986e+19 | 2 | 66924.0101 |
| city:rest_index | 1.822657e+16 | 2 | 11619.2001 |
| city:rest_index_norm | 1.822298e+16 | 2 | 11618.6288 |
| city:lng | 1.584825e+07 | 2 | 63.0951 |
| city:lat | 4.450921e+13 | 2 | 2582.9290 |
| city:weekend_weekday | 3.121691e+02 | 2 | 4.2034 |
# use ridge to reduce high multicollinearity
airbnb_X_train <- model.matrix(object = realSum ~ city * ., data = airbnb_train)[, -1]
airbnb_Y_train <- airbnb_train$realSum
airbnb_X_test <- model.matrix(object = realSum ~ city * ., data = airbnb_test)[, -1]
airbnb_Y_test <- airbnb_test$realSum
cv_ridge_model <- cv.glmnet(x = airbnb_X_train, y = airbnb_Y_train, alpha = 0)
cv_ridge_model
Call: cv.glmnet(x = airbnb_X_train, y = airbnb_Y_train, alpha = 0)
Measure: Mean-Squared Error
Lambda Index Measure SE Nonzero
min 13 100 86476 21492 53
1se 7697 31 107602 23499 53
best_lambda <- cv_ridge_model$lambda.min
best_lambda
ridge_final <- glmnet(x = airbnb_X_train, y = airbnb_Y_train, alpha = 0, lambda = best_lambda)
ridge_coefs <- coef(ridge_final, s = best_lambda)
ridge_coefs
54 x 1 sparse Matrix of class "dgCMatrix"
s1
(Intercept) -7.275602e+01
cityLondon -1.736851e+01
cityRome -2.922555e+00
room_typePrivate room -6.089186e+01
person_capacity 1.233826e+01
host_is_superhostTRUE -8.885998e+00
multi -1.253954e+01
biz 2.731605e+00
cleanliness_rating 1.560157e+00
guest_satisfaction_overall 9.417247e-01
bedrooms 4.624269e+01
dist 3.910691e+00
metro_dist -5.100026e+00
attr_index 3.392743e-02
attr_index_norm 2.698976e+00
rest_index 7.640806e-03
rest_index_norm -8.701638e-02
lng 8.866412e-01
lat -1.404539e-01
weekend_weekdayWeekend 1.459653e+01
cityLondon:room_typePrivate room -1.311924e+02
cityRome:room_typePrivate room 2.025431e+01
cityLondon:person_capacity 2.131494e+01
cityRome:person_capacity -2.232379e-01
cityLondon:host_is_superhostTRUE 3.067100e+01
cityRome:host_is_superhostTRUE 6.143731e+00
cityLondon:multi -4.238407e-01
cityRome:multi 1.362125e+01
cityLondon:biz -1.710014e+01
cityRome:biz 1.471694e+00
cityLondon:cleanliness_rating 6.437983e-02
cityRome:cleanliness_rating -7.912097e-01
cityLondon:guest_satisfaction_overall 4.341831e-02
cityRome:guest_satisfaction_overall -1.228294e-02
cityLondon:bedrooms 1.181335e+02
cityRome:bedrooms -5.483241e-01
cityLondon:dist 1.446334e+00
cityRome:dist -4.818921e+00
cityLondon:metro_dist -8.726242e+00
cityRome:metro_dist 5.530701e+00
cityLondon:attr_index 1.755025e-01
cityRome:attr_index -3.602553e-03
cityLondon:attr_index_norm 2.459794e+00
cityRome:attr_index_norm -9.783322e-02
cityLondon:rest_index 8.779068e-03
cityRome:rest_index 1.244899e-02
cityLondon:rest_index_norm 4.681783e-01
cityRome:rest_index_norm 5.747371e-01
cityLondon:lng -1.706205e+02
cityRome:lng 1.147184e+00
cityLondon:lat -5.336878e-01
cityRome:lat 1.792371e-01
cityLondon:weekend_weekdayWeekend -1.475659e+01
cityRome:weekend_weekdayWeekend -6.801898e+00
airbnb_full_OLS <- lm(realSum ~ city * ., data = airbnb_train)
# compare full model and ridge model
airbnb_coefs <- cbind(
Full_OLS = as.vector(coef(airbnb_full_OLS)),
Ridge_min = as.vector(ridge_coefs)) %>%
round(4) %>% as.data.frame()
#airbnb_coefs
airbnb_test_pred_full_OLS <- predict(airbnb_full_OLS, newdata = airbnb_test)
head(airbnb_test_pred_full_OLS)
- 1
- 156.164828357232
- 2
- 140.254735874284
- 3
- 151.113978200912
- 4
- 156.237766532104
- 5
- 392.887166767934
- 6
- 179.53498815915
# compute RMSE of the full predictive model
airbnb_test_RMSEs <- tibble(
Model = "OLS Full Regression",
RMSE = mltools::rmse(
preds = airbnb_test_pred_full_OLS,
actuals = airbnb_test$realSum
)
)
airbnb_test_RMSEs
| Model | RMSE |
|---|---|
| <chr> | <dbl> |
| OLS Full Regression | 245.388 |
# compute RMSE of ridge (with best lambda) model
airbnb_test_pred_ridge <- predict(ridge_final, newx = airbnb_X_test)
airbnb_test_RMSEs <- rbind(
airbnb_test_RMSEs,
tibble(
Model = "Ridge Regression with minimum MSE",
RMSE = mltools::rmse(
pred = airbnb_test_pred_ridge,
actuals = airbnb_Y_test)
)
)
airbnb_test_RMSEs
| Model | RMSE |
|---|---|
| <chr> | <dbl> |
| OLS Full Regression | 245.3880 |
| Ridge Regression with minimum MSE | 244.9649 |
Analysis¶
I attempted Ridge Regression because I had really high multicollinearity and thought that using a Ridge model would help minimize this. However after further analysis, the difference between the RMSE values of a full linear regression model and the Ridge regression model is minimal. After Ridge, multicollinearity still occurs and because Ridge doesn't select and remove any variables, other models such as Lasso or stepwise selection models might be better options.